## The University of Alabama in Huntsville ECE Department CPE 431 01, CPE 531 01/01R Fall 2022 Advanced Pipelining

Due October 12, 2022 1.0(10), 2.0(10), 3.0.1(10), 3.0.2(10), 4.0(30)

4.7> This exercise is intended to help you understand the cost/complexity/performance trade-offs of forwarding in a pipelined processor. Problems in this exercise refer to pipelined datapaths from Figure 4.45. Thise problems assume that, of all the instructions executed in a processor, the following fraction of these instructions have a particular type of RAW data dependence. The type of RAW data dependence is identifies by the stage that produces the result (EX or MEM) and the instruction that consumes the result (1st instruction that follows the one that produces the result, 2nd instruction that follows, or both). We assume that the register write is done in the first half of the clock cycle and that register reads are done in the second half of the cycle so "EX to 3rd" and "MEM to 3rd" dependences are not counted because they cannot result in data hazards. Also, assume that the CPI of the processor is 1 if there are no data hazards.

| EX to 1 <sup>st</sup> | Memto 1 <sup>st</sup> | EX to 2 <sup>nd</sup> Only | MEMto 2 <sup>nd</sup> | EX to 1 <sup>st</sup> and | Other RAW   |
|-----------------------|-----------------------|----------------------------|-----------------------|---------------------------|-------------|
| Only                  | Only                  |                            | Only                  | MEM to 2 <sup>nd</sup>    | dependences |
| 10%                   | 25%                   | 5%                         | 10%                   | 15%                       | 10%         |

EX to 1st only example
 add \$t0, \$t1, \$t2
 add \$t3, \$t0, \$t4

MEM to 1st only example
 lw \$t0, 20(\$t1)
 add \$t3, \$t0, \$t4

EX to 2nd only example
 add \$t0, \$t1, \$t2
 add \$t3, \$t5, \$t4
 add \$t6, \$t0, \$s0

MEM to 2<sup>nd</sup> only example lw \$t0, 20(\$t1) add \$t3, \$t5, \$t4 add \$t6, \$t0, \$s0 EX to 1st and MEM to 2nd example \$t4, 24(\$t0) lw add \$t9, \$s0, \$s4 add \$s1, \$t4, \$t9 Other RAW Dependence example add \$t0, \$t1, \$t2 add \$s0, \$t3, \$s0 addi \$s1, \$s1, 4 add \$t3, \$t0, \$t4



If we use forwarding only from the MEM/WB pipeline register, what fraction of cycles are we stalling due to data hazards?

2.0 <4.8> This exercise is intended to help you understand the relationship between delay slots, control hazards, and branch exectution in a pipelined processor. In this exercise, we assume that the following MIPS code is executed on a pipelined processor with a 5-stage pipeline, full forwarding, and a predict-taken branch predictor:

```
lw
                r2, 0(r1)
                r2, r0, label2
label1:
         beq
                                    # not taken once, then taken
                r3, 0(r2)
         lw
                r3, r0, label1
         beq
                                    #taken
         add
                r1, r3, r1
label2:
         SW
                r1, 0(r2)
```

Draw the pipeline execution diagram for this code, assuming there are no delay slots and that branches execute in the EX stage.

3.0 <4.8> The importance of having a good branch predictor depends on how often conditional branches are executed. Together with branch predictor accuracy, this will determine how much time is spent stalling due to mispredicted branches. In this exercise, assume that the breakdown of dynamic instructions into various instruction categories is as follows:

| R-type | BEQ | JMP | LW  | SW  |
|--------|-----|-----|-----|-----|
| 45%    | 20% | 5%  | 20% | 10% |

Also, assume the following branch predictor accuracies:

| 0 1          |                  |       |
|--------------|------------------|-------|
| Always-Taken | Always-Not-Taken | 2-Bit |
| 40%          | 60%              | 80%   |

**3.0.1** Stall cycles due to mispredicted branches increase the CPI. What is the extra CPI due to mispredicted branches with the always-taken predictor? Assume that branch outcomes are determined in the EX stage, that there are no data hazards, and that no delay slots are used.

- **3.0.1** With the 2-bit predictor, what speedup would be achieved if we could convert half of the branch instructions in a way that replaces a branch instruction with an ALU instruction? Assume that correctly and incorrectly predicted instructions have the same chance of being replaced.
- 4.10> In this exercise, we consider the execution of a loop in a statically scheduled superscalar processor that has full forwarding. To simplify the exercise, assume that any combination of instruction types can execute in the same cycle, e.g., in a 3-issue superscalar, the three instructions can be three ALU operations, three branches, three load/store instruction, or any combination of these instructions. Note that this only removes a resource constraint, but data and control dependences must be still be handled correctly. Problems in this exercise refer to the following loop:

```
Loop: lw $t3, 0($s1)
lw $t4, 0($s2)
mul $t1, $t3, $t4
add $s0, $t1, $s0
addi $s1, $s1, -8
addi $s2, $s2, -8
bne $s1, $zero, Loop
```

Unroll this loop so that four iterations of it are done at once and schedule it for a 2-issue static superscalar processor. Assume that the loop always executes a number of iterations that is a multiple of 4. You can use any unused registers when changing the code to eliminate dependences.